This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Try executing this chunk by clicking the Run button within
the chunk or by placing your cursor inside it and pressing
Cmd+Shift+Enter.
Install packages
# install.packages("readr")
# install.packages("dplyr")
# install.packages("stringr")
# install.packages("shiny")
# install.packages("ggplot2")
# install.packages("plotly")
Load in packages
# Allows us to read-in csv files
library(readr)
# For data manipulation
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# For regular expression operations
library(stringr)
# library(shiny)
library(ggplot2)
# Used tp create interactive visualisations
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
Load-in dataset
df <- read_csv('Data/GI_age.csv')
Rows: 42 Columns: 7── Column specification ────────────────────────────────────
Delimiter: ","
chr (4): England and Wales Code, England and Wales, Gend...
dbl (3): Gender identity (7 categories) Code, Age (6 cat...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Brief glimpse of data structure
# But can also click on the dataset in the Environment pane
head(df, 10)
# Let's check out the dimensions
dim(df)
[1] 42 7
Data Cleaning
# str_replace_all() method finds all substrings which match the regex and replaces them with empty string
# First, let's replace any brackets with empty strings
colnames(df) <- str_replace_all(colnames(df), "\\s*\\([^)]*\\)", "")
# Lowercase column text and replace empty spaces with "_"
colnames(df) <- tolower(colnames(df))
colnames(df) <- str_replace_all(colnames(df), " ", "_")
# Let's see if it worked..
head(df)
Pipes and other operators..
So, we’ve already come across the assignment operator ‘<-’ which
is used to assign a value. E.g. df <- read_csv(‘Data/GI_age.csv’),
here we assign our csv file to a dataframe variable called ‘df’.
But, we’re now going to encounter the pipe operator ‘%>%’ which
can seem intimidating at first but is actually pretty simple. It’s used
to pass the result of one function directly into the next function. E.g.
df <- df %>% filter(gender_identity_code != -8), here we start
with our df and pass it to the filter function using the pipe operator.
This basically supplies the filter() function with its first argument,
which is the dataframe to filter on. And here we encounter a logical
operator ‘!=’ within the filter() function, which specifies that we
should only keep rows where gender_identity_code is not equal to -8.
# Get rid of columns with 0 observations
df <- df %>%
filter(gender_identity_code != -8)
# Check it worked
head(df, 10)
# Get rid of redundant age category
# Further filter data
df <- df %>%
filter(age_code != 1)
# Clean up the values in the 'age' column. Let's shorten them.
# Chain str_replace() calls together to apply multiple string replacements in succession
# Each str_replace() call is applied to the result of the previous one
df$age <- df$age %>%
str_replace('Aged ', '') %>%
str_replace('to', '-') %>%
str_replace('years', '') %>%
str_replace('and over', '+')
# We can pass our df to the select function, where we specify the column we're interested in.
# Then, we pipe the output to the head function.
df %>%
select(age) %>%
head()
Question
How is gender identity distributed among different age groups?
Some subquestions that this can help us answer:
- What % of trans men are aged 16-24 years?
- Are older age groups overrepresented in the ‘non-response’
category?
Data pre-processing
Calculate percentages
Below, we use the group_by function to group the data by
‘gender_identity’ and calculate the percentage within each group. Then
the mutate() function adds a new column ‘percentage’ to df, which (for
each group) divides the observation by the sum of observations,
multiplies it by 100, and rounds it up to 2 decimal points. We then use
the ungroup function when we’re done with the grouping operation.
df <- df %>%
group_by(gender_identity) %>%
mutate(percentage = round((observation / sum(observation) * 100), 2)) %>%
ungroup()
head(df)
# Directly convert to a factor with the specified order
df$gender_identity <- factor(df$gender_identity, levels = c(
"Gender identity the same as sex registered at birth",
"Gender identity different from sex registered at birth but no specific identity given",
"Trans woman",
"Trans man",
"All other gender identities",
"Not answered"
))
# Print the levels to ensure they are correct
print(levels(df$gender_identity))
[1] "Gender identity the same as sex registered at birth"
[2] "Gender identity different from sex registered at birth but no specific identity given"
[3] "Trans woman"
[4] "Trans man"
[5] "All other gender identities"
[6] "Not answered"
Interactive grouped bar chart + stacked bar chart
So, the convention when using Plotly in R, is to create our plot
first by using the ggplot2 package. Then, we convert the ggplot object
to a ‘plotly’ object using ‘ggplotly’. There’s a lot going on here so
I’ll break some of it down. The ggplot() function initialises a ggplot
object, which sets up the dataframe that will be used for the plot and
specifies the aesthetic mappings which describe how variables in the
data are mapped to visual properties. So, inside aes() we specify our x
and y columns, and specify that we want to map our age column to fill
the colour of the bars.
Meanwhile, geom_bar() is used to make bar charts, so it adds the bar
geometry to the plot. And we set stat to ‘identity’, which tells
‘ggplot’ to use the value in the y-axis column (‘percentage’) for the
height of the bars. By setting position to ‘dodge’ we ensure that the
bars are placed next to each other.
Finally, labs() is used to add or modify labels, and theme is used to
customise non-data parts of the plot like text, legend, axes. And
scale_fill_discrete() controls the colour scales and here we use the
name parameter to label our legend “Age”.
TLDR: we’re using the + operator and ggplot functions to build upon
the base ggplot object, layering on aesthetic mappings, geometries,
labels, etc.
p <- ggplot(df, aes(x = gender_identity, y = percentage, fill = age,
text = paste('Observation:', observation))) + # Include observation info
geom_bar(stat = "identity", position = "dodge") +
labs(title = 'Distribution of Gender Identity Categories Among Age Groups',
x = 'Gender Identity', y = 'Percentage') +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_discrete(name = "Age")
# Let's take a look at our static graph
p

Hmm okay. Not too shabby, but we’re definitely going to have to do
something about our x-axis labels, as right now everything is pretty
cluttered. Maybe we could rotate them, or just rename them. We’ll get
round to it. But for now, let’s make this thing interactive.
# Convert ggplot object to a plotly object for interactivity
fig <- ggplotly(p, tooltip = c("y", "fill", "text"), width = 700, height = 500) # Specify tooltip components
# Let's check it out
fig
NA
tickvals <- 1:length(levels(df$gender_identity))
# Custom tick labels corresponding to the levels
ticktext <- c(
"Cisgender",
"Gender identity different from sex",
"Trans woman",
"Trans man",
"All other identities",
"Not answered"
)
Dataset 2
# Load in dataset
df2 <- read_csv('Data/GI_ethnic.csv')
Rows: 10592 Columns: 7── Column specification ────────────────────────────────────
Delimiter: ","
chr (4): Lower tier local authorities Code, Lower tier l...
dbl (3): Gender identity (4 categories) Code, Ethnic gro...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Brief glimpse at underlying data structure
head(df2, 10)
# Remove all text within parentheses from column names and replace it with an empty string
# tilde operator (~) used to apply function 'gsub' to each colname
# .x represents each colname that gsub will be applied to
df2 <- df2 %>%
rename_with(~ gsub("\\s*\\([^)]*\\)", "", .x))
# Lowercase all text in column names and replace spaces with underscores
df2 <- df2 %>%
rename_with(~ tolower(gsub(" ", "_", .x)))
# Shorten the local authority column names as they are way too long
df2 <- df2 %>%
rename(LA_code = lower_tier_local_authorities_code,
LA_name = lower_tier_local_authorities)
# Let's see if it worked
colnames(df2)
[1] "LA_code" "LA_name"
[3] "gender_identity_code" "gender_identity"
[5] "ethnic_group_code" "ethnic_group"
[7] "observation"
# Remove 'Does not apply' categories for the gender identity and ethnic group columns
df2 <- df2 %>%
filter(gender_identity_code != -8, ethnic_group_code != -8)
# Display the first 30 rows
head(df2, 30)
Data pre-processing
Calculate % of each ethnic group in each LA
# First, we're going to group our data by LA_name, Ethnic group, and sum our observations
# This leaves us with the total of each ethnic group in each local authority
ethnic_totals <- df2 %>%
group_by(LA_name, ethnic_group) %>%
summarise(Ethnic_sum = sum(observation, na.rm = TRUE)) %>%
ungroup()
`summarise()` has grouped output by 'LA_name'. You can override using the `.groups` argument.
# Print the first few rows to check
head(ethnic_totals)
# Calculate total observations for each local authority
la_totals <- df2 %>%
group_by(LA_name) %>%
summarise(LA_sum = sum(observation, na.rm = TRUE)) %>%
ungroup()
# Print the first few rows to check
head(la_totals)
# Merge the summed ethnic group data with the total LA observations to calculate percentage of each ethnic group in each LA
merged <- merge(ethnic_totals, la_totals, by = "LA_name")
# Calculate the percentage of each ethnic group within each local authority
merged <- merged %>%
mutate(Percentage = round((Ethnic_sum / LA_sum * 100), 2))
# Print the first few rows to check
head(merged, 10)
Calculate Non-Response Rates Within LAs
# Now calculate the non-response % for each ethnic group in each LA
# This involves grouping by LA_name, ethnic_group, and summing observations again
ethnic_group_totals <- df2 %>%
group_by(LA_name, ethnic_group) %>%
summarise(Ethnic_group_total = sum(observation, na.rm = TRUE)) %>%
ungroup()
`summarise()` has grouped output by 'LA_name'. You can override using the `.groups` argument.
# This is the sum of non-responses for each ethnic group within each LA
non_response_totals <- df2 %>%
filter(gender_identity == 'Not answered') %>%
group_by(LA_name, ethnic_group) %>%
summarise(Non_response_total = sum(observation, na.rm = TRUE)) %>%
ungroup()
`summarise()` has grouped output by 'LA_name'. You can override using the `.groups` argument.
head(non_response_totals)
# Merge the totals with the non-response totals
merged_data <- merge(ethnic_group_totals, non_response_totals, by = c("LA_name", "ethnic_group"), all.x = TRUE)
head(merged_data)
# Calculate the non-response percentage for each ethnic group within each LA
merged_data <- merged_data %>%
mutate(Eth_NR_Perc = round((Non_response_total / Ethnic_group_total * 100), 2))
head(merged_data)
# Merge the non-response data with the percentage of each ethnic group within each LA
nr <- merge(merged_data, select(merged, LA_name, ethnic_group, Percentage), by = c("LA_name", "ethnic_group"))
head(nr)
Interactive scatterplot
In this section we’re going to:
Create a simple scatterplot exploring the relationship between
the percentage of asian citizens within local authorities and their
non-response rates
Implement a widget to update our scatterplot
# Subset dataframe so we only have responses from the asian ethnic group
asian <- nr %>%
filter(ethnic_group == 'Asian, Asian British or Asian Welsh')
head(asian)
# Initialize figure
fig <- plot_ly(data = asian,
x = ~Percentage,
y = ~Eth_NR_Perc,
text = ~paste('LA Name:', LA_name,
'<br>Percentage:', sprintf("%.2f", Percentage),
'<br>Non-response Rate:', sprintf("%.2f%%", Eth_NR_Perc),
'<br>Non-response Total:', Non_response_total,
'<br>Ethnic Group Total:', Ethnic_group_total),
hoverinfo = "text",
mode = 'markers', # Specify marker points
type = 'scatter', # Graph type - scatterplot
name = 'Asian') # Default visible graph
# Customize layout
fig <- fig %>%
layout(title = 'Non-Response Rates of the Asian Ethnic Group Across Local Authorities',
xaxis = list(title = 'Percentage of Ethnic Group'),
yaxis = list(title = 'Non-response Rate'),
width = 900,
height = 900)
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
# Show the plot
fig
Dropdown selection
What we’re going to do now, is use Plotly’s ‘updatemenus’ in
conjunction with the ‘update’ method to create a dropdown where we can
switch between the Asian ethnic group, and White.
Step 1: Initialise figure and add traces
We’ll start by creating a plot_ly figure. We use plotly figures here
instead of ggplotly, because plot_ly objects offer more control over how
plots are constructed. It allows us to add ‘traces’, which refer to a
set of data. In our example, we want to add a trace with the data points
relating to our asian ethnic group, and another one for our white ethnic
group. This will start to make sense when we look at the code below.
# Initialize a Plotly figure
fig <- plot_ly()
fig
Warning: No trace type specified and no positional attributes specifiedNo trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Warning: No trace type specified and no positional attributes specifiedNo trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
# Add trace for the Asian ethnic group
fig <- fig %>% add_trace(
data = nr[nr$ethnic_group == 'Asian, Asian British or Asian Welsh',],
x = ~Percentage,
y = ~Eth_NR_Perc,
text = ~paste('LA Name:', LA_name),
type = 'scatter',
mode = 'markers',
name = 'Asian',
hoverinfo = 'text+x+y',
visible = T
)
# Add trace for the White ethnic group
fig <- fig %>% add_trace(
data = nr[nr$ethnic_group == 'White: English, Welsh, Scottish, Northern Irish or British',],
x = ~Percentage,
y = ~Eth_NR_Perc,
text = ~paste('LA Name:', LA_name),
type = 'scatter',
mode = 'markers',
name = 'White',
hoverinfo = 'text+x+y',
visible = F
)
fig
# Define dropdown buttons for interactivity
fig <- fig %>% layout(
title = "Non-Response Rates Across Local Authorities",
xaxis = list(title = "Percentage of Ethnic Group"),
yaxis = list(title = "Non-response Rate"),
showlegend = FALSE,
updatemenus = list(
list(
type = "dropdown",
buttons = list(
list(
method = "update",
args = list(list("visible" = list(TRUE, FALSE)),
list("title" = "Non-Response Rates of the Asian Ethnic Group Across Local Authorities")),
label = "Asian"
),
list(
method = "update",
args = list(list("visible" = list(FALSE, TRUE)),
list("title" = "Non-Response Rates of the White Ethnic Group Across Local Authorities")),
label = "White"
)
)
)
)
)
# Display the figure
fig
Add a new chunk by clicking the Insert Chunk button on the
toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and
output will be saved alongside it (click the Preview button or
press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the
editor. Consequently, unlike Knit, Preview does not
run any R code chunks. Instead, the output of the chunk when it was last
run in the editor is displayed.
---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

## Install packages

```{r}
# install.packages("readr")
# install.packages("dplyr")
# install.packages("stringr")
# install.packages("shiny")
# install.packages("ggplot2")
# install.packages("plotly")
```

## Load in packages

```{r}
# Allows us to read-in csv files
library(readr) 
# For data manipulation
library(dplyr) 
# For regular expression operations 
library(stringr) 
# library(shiny)
library(ggplot2)
# Used tp create interactive visualisations
library(plotly)
```
## Load-in dataset

```{r}
df <- read_csv('Data/GI_age.csv')
```
```{r}
# Brief glimpse of data structure
# But can also click on the dataset in the Environment pane
```


```{r}
head(df, 10)
```

```{r}
# Let's check out the dimensions

dim(df)
```

## Data Cleaning

```{r}
# str_replace_all() method finds all substrings which match the regex and replaces them with empty string
# First, let's replace any brackets with empty strings
colnames(df) <- str_replace_all(colnames(df), "\\s*\\([^)]*\\)", "")

# Lowercase column text and replace empty spaces with "_"
colnames(df) <- tolower(colnames(df))
colnames(df) <- str_replace_all(colnames(df), " ", "_")

# Let's see if it worked..
head(df)
```

### Pipes and other operators..

So, we've already come across the assignment operator '<-' which is used to assign a value. E.g. df <- read_csv('Data/GI_age.csv'), here we assign our csv file to a dataframe variable called 'df'.

But, we're now going to encounter the pipe operator '%>%' which can seem intimidating at first but is actually pretty simple. It's used to pass the result of one function directly into the next function. E.g. df <- df %>% filter(gender_identity_code != -8), here we start with our df and pass it to the filter function using the pipe operator. This basically supplies the filter() function with its first argument, which is the dataframe to filter on. And here we encounter a logical operator '!=' within the filter() function, which specifies that we should only keep rows where gender_identity_code is not equal to -8. 

```{r}
# Get rid of columns with 0 observations
df <- df %>% 
  filter(gender_identity_code != -8) 

# Check it worked

head(df, 10)
```

```{r}
# Get rid of redundant age category
# Further filter data
df <- df %>%
  filter(age_code != 1)

```

```{r}
# Clean up the values in the 'age' column. Let's shorten them.

# Chain str_replace() calls together to apply multiple string replacements in succession
# Each str_replace() call is applied to the result of the previous one
df$age <- df$age %>%
  str_replace('Aged ', '') %>%
  str_replace('to', '-') %>%
  str_replace('years', '') %>%
  str_replace('and over', '+')

# We can pass our df to the select function, where we specify the column we're interested in.
# Then, we pipe the output to the head function.
df %>%
  select(age) %>%
  head()
```

## Question

How is gender identity distributed among different age groups?

Some subquestions that this can help us answer:

* What % of trans men are aged 16-24 years?
* Are older age groups overrepresented in the 'non-response' category?

## Data pre-processing

### Calculate percentages 

Below, we use the group_by function to group the data by 'gender_identity' and calculate the percentage within each group. Then the mutate() function adds a new column 'percentage' to df, which (for each group) divides the observation by the sum of observations, multiplies it by 100, and rounds it up to 2 decimal points. We then use the ungroup function when we're done with the grouping operation. 

```{r}
df <- df %>%
  group_by(gender_identity) %>%
  mutate(percentage = round((observation / sum(observation) * 100), 2)) %>%
  ungroup()

head(df)
```

```{r}
# Directly convert to a factor with the specified order
df$gender_identity <- factor(df$gender_identity, levels = c(
  "Gender identity the same as sex registered at birth",
  "Gender identity different from sex registered at birth but no specific identity given",
  "Trans woman",
  "Trans man",
  "All other gender identities",
  "Not answered"
))
```

```{r}
# Print the levels to ensure they are correct
print(levels(df$gender_identity))

```


## Interactive grouped bar chart + stacked bar chart

So, the convention when using Plotly in R, is to create our plot first by using the ggplot2 package. Then, we convert the ggplot object to a 'plotly' object using 'ggplotly'. There's a lot going on here so I'll break some of it down. The ggplot() function initialises a ggplot object, which sets up the dataframe that will be used for the plot and specifies the aesthetic mappings which describe how variables in the data are mapped to visual properties. So, inside aes() we specify our x and y columns, and specify that we want to map our age column to fill the colour of the bars.

Meanwhile, geom_bar() is used to make bar charts, so it adds the bar geometry to the plot. And we set stat to 'identity', which tells 'ggplot' to use the value in the y-axis column ('percentage') for the height of the bars. By setting position to 'dodge' we ensure that the bars are placed next to each other. 

Finally, labs() is used to add or modify labels, and theme is used to customise non-data parts of the plot like text, legend, axes. And scale_fill_discrete() controls the colour scales and here we use the name parameter to label our legend "Age". 

TLDR: we're using the + operator and ggplot functions to build upon the base ggplot object, layering on aesthetic mappings, geometries, labels, etc.

```{r}
p <- ggplot(df, aes(x = gender_identity, y = percentage, fill = age,
                    text = paste('Observation:', observation))) +  # Include observation info
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = 'Distribution of Gender Identity Categories Among Age Groups',
       x = 'Gender Identity', y = 'Percentage') +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete(name = "Age")

# Let's take a look at our static graph
p
```

Hmm okay. Not too shabby, but we're definitely going to have to do something about our x-axis labels, as right now everything is pretty cluttered. Maybe we could rotate them, or just rename them. We'll get round to it. But for now, let's make this thing interactive.

```{r}
# Convert ggplot object to a plotly object for interactivity
fig <- ggplotly(p, tooltip = c("y", "fill", "text"), width = 700, height = 500)  # Specify tooltip components


# Let's check it out
fig

```

```{r}
tickvals <- 1:length(levels(df$gender_identity))

# Custom tick labels corresponding to the levels
ticktext <- c(
  "Cisgender", 
  "Gender identity different from sex",
  "Trans woman",
  "Trans man",
  "All other identities",
  "Not answered"
)

```

## Tooltips 

When using different R libraries geared towards interactive visualisations, you'll often come across 'tooltips'. These are small boxes that provide information when a user hovers over a part of a data visualisation such as: a point on a graph, a bar in a bar chart, or a segment in a pie chart. They are used to display additional information about the data point or object, providing more context without cluttering up the chart. 


```{r}
# Specify custom tick labels with the corresponding tick values
fig <- fig %>%
  layout(
    title = list(text = 'Distribution of Gender Identity Categories Among Age Groups', x = 0.5),
    xaxis = list(
      title = list('Gender Identity',
      standoff = 15),
      tickmode = "array",
      tickvals = tickvals,
      ticktext = ticktext,
      tickangle = -45
    ),
    yaxis = list(title = list('Percentage',
                 standoff = 25)))

fig
```

## Dataset 2

```{r}
# Load in dataset
df2 <- read_csv('Data/GI_ethnic.csv')
```
```{r}
# Brief glimpse at underlying data structure
head(df2, 10)
```

```{r}
# Remove all text within parentheses from column names and replace it with an empty string

# tilde operator (~) used to apply function 'gsub' to each colname
# .x represents each colname that gsub will be applied to
df2 <- df2 %>% 
  rename_with(~ gsub("\\s*\\([^)]*\\)", "", .x))
```

```{r}
# Lowercase all text in column names and replace spaces with underscores
df2 <- df2 %>% 
  rename_with(~ tolower(gsub(" ", "_", .x)))
```

```{r}
# Shorten the local authority column names as they are way too long
df2 <- df2 %>% 
  rename(LA_code = lower_tier_local_authorities_code,
         LA_name = lower_tier_local_authorities)

```

```{r}
# Let's see if it worked
colnames(df2)
```
```{r}
# Remove 'Does not apply' categories for the gender identity and ethnic group columns
df2 <- df2 %>% 
  filter(gender_identity_code != -8, ethnic_group_code != -8)
```

```{r}
# Display the first 30 rows
head(df2, 30)
```

## Data pre-processing

### Calculate % of each ethnic group in each LA

```{r}
# First, we're going to group our data by LA_name, Ethnic group, and sum our observations
# This leaves us with the total of each ethnic group in each local authority
ethnic_totals <- df2 %>%
  group_by(LA_name, ethnic_group) %>%
  summarise(Ethnic_sum = sum(observation, na.rm = TRUE)) %>%
  ungroup()

# Print the first few rows to check
head(ethnic_totals)
```
```{r}
# Calculate total observations for each local authority
la_totals <- df2 %>%
  group_by(LA_name) %>%
  summarise(LA_sum = sum(observation, na.rm = TRUE)) %>%
  ungroup()

# Print the first few rows to check
head(la_totals)
```

```{r}
# Merge the summed ethnic group data with the total LA observations to calculate percentage of each ethnic group in each LA
merged <- merge(ethnic_totals, la_totals, by = "LA_name")
```

```{r}
# Calculate the percentage of each ethnic group within each local authority
merged <- merged %>%
  mutate(Percentage = round((Ethnic_sum / LA_sum * 100), 2))
```


```{r}
# Print the first few rows to check
head(merged, 10)
```

### Calculate Non-Response Rates Within LAs

```{r}
# Now calculate the non-response % for each ethnic group in each LA
# This involves grouping by LA_name, ethnic_group, and summing observations again
ethnic_group_totals <- df2 %>%
  group_by(LA_name, ethnic_group) %>%
  summarise(Ethnic_group_total = sum(observation, na.rm = TRUE)) %>%
  ungroup()
```

```{r}
# This is the sum of non-responses for each ethnic group within each LA
non_response_totals <- df2 %>%
  filter(gender_identity == 'Not answered') %>%
  group_by(LA_name, ethnic_group) %>%
  summarise(Non_response_total = sum(observation, na.rm = TRUE)) %>%
  ungroup()
```

```{r}
head(non_response_totals)
```
```{r}
# Merge the totals with the non-response totals
merged_data <- merge(ethnic_group_totals, non_response_totals, by = c("LA_name", "ethnic_group"), all.x = TRUE)


head(merged_data)
```

```{r}
# Calculate the non-response percentage for each ethnic group within each LA
merged_data <- merged_data %>%
  mutate(Eth_NR_Perc = round((Non_response_total / Ethnic_group_total * 100), 2))
```


```{r}
head(merged_data)
```

```{r}
# Merge the non-response data with the percentage of each ethnic group within each LA
nr <- merge(merged_data, select(merged, LA_name, ethnic_group, Percentage), by = c("LA_name", "ethnic_group"))

head(nr)
```

## Interactive scatterplot

In this section we're going to:

1. Create a simple scatterplot exploring the relationship between the percentage of asian citizens within local authorities and their non-response rates

2. Implement a widget to update our scatterplot

```{r}
# Subset dataframe so we only have responses from the asian ethnic group

asian <- nr %>%
  filter(ethnic_group == 'Asian, Asian British or Asian Welsh')

head(asian)
```

```{r}
# Initialize figure
fig <- plot_ly(data = asian,
               x = ~Percentage,
               y = ~Eth_NR_Perc,
               text = ~paste('LA Name:', LA_name, 
                             '<br>Percentage:', sprintf("%.2f", Percentage),
                             '<br>Non-response Rate:', sprintf("%.2f%%", Eth_NR_Perc),
                             '<br>Non-response Total:', Non_response_total,
                             '<br>Ethnic Group Total:', Ethnic_group_total),
               hoverinfo = "text",
               mode = 'markers',  # Specify marker points
               type = 'scatter',  # Graph type - scatterplot
               name = 'Asian')  # Default visible graph


# Customize layout 
fig <- fig %>%
  layout(title = 'Non-Response Rates of the Asian Ethnic Group Across Local Authorities',
         xaxis = list(title = 'Percentage of Ethnic Group'),
         yaxis = list(title = 'Non-response Rate'),
         width = 900,
         height = 900)

# Show the plot
fig
```

## Dropdown selection

What we're going to do now, is use Plotly's 'updatemenus' in conjunction with the 'update' method to create a dropdown where we can switch between the Asian ethnic group, and White. 

### Step 1: Initialise figure and add traces

We'll start by creating a plot_ly figure. We use plotly figures here instead of ggplotly, because plot_ly objects offer more control over how plots are constructed. It allows us to add 'traces', which refer to a set of data. In our example, we want to add a trace with the data points relating to our asian ethnic group, and another one for our white ethnic group. This will start to make sense when we look at the code below. 

```{r}
# Initialize a Plotly figure
fig <- plot_ly()

fig

```

```{r}
# Add trace for the Asian ethnic group
fig <- fig %>% add_trace(
  data = nr[nr$ethnic_group == 'Asian, Asian British or Asian Welsh',],
  x = ~Percentage,
  y = ~Eth_NR_Perc,
  text = ~paste('LA Name:', LA_name),
  type = 'scatter',
  mode = 'markers',
  name = 'Asian',
  hoverinfo = 'text+x+y',
  visible = T
)

# Add trace for the White ethnic group
fig <- fig %>% add_trace(
  data = nr[nr$ethnic_group == 'White: English, Welsh, Scottish, Northern Irish or British',],
  x = ~Percentage,
  y = ~Eth_NR_Perc,
  text = ~paste('LA Name:', LA_name),
  type = 'scatter',
  mode = 'markers',
  name = 'White',
  hoverinfo = 'text+x+y',
  visible = F
)

fig
```


```{r}

# Define dropdown buttons for interactivity
fig <- fig %>% layout(
  title = "Non-Response Rates Across Local Authorities",
  xaxis = list(title = "Percentage of Ethnic Group"),
  yaxis = list(title = "Non-response Rate"),
  showlegend = FALSE,
  updatemenus = list(
    list(
      type = "dropdown",
      buttons = list(
        list(
          method = "update",
          args = list(list("visible" = list(TRUE, FALSE)),
                      list("title" = "Non-Response Rates of the Asian Ethnic Group Across Local Authorities")),
          label = "Asian"
        ),
        list(
          method = "update",
          args = list(list("visible" = list(FALSE, TRUE)),
                      list("title" = "Non-Response Rates of the White Ethnic Group Across Local Authorities")),
          label = "White"
        )
      )
    )
  )
)

# Display the figure
fig
```

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file). 

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

